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Abstract 

The  On-Chip  Parallelism  of  VLSI  Circuits 

by  Mary  Lane  Bailey 


Chairperson  of  the  Supervisory  Committee:  Professor  Lawrence  Snyder 

Department  of  Computer  Science 


Simulation  is  a  bottleneck  in  VLSI  circuit  design.  Not  only  are  there  many  simulation 
runs  throughout  the  design  cycle,  but  each  run  can  take  hours  or  days  to  complete.  One 
often  suggested  means  of  speeding  up  event-driven  simulation  is  to  use  multiple  processors 
to  exploit  the  natural  parallelism  present  in  the  circuit,  that  is  to  partition  the  circuit 
among  multiple  processors,  with  each  executing  the  same  algorithm  on  its  portion  of  the 
circuit.  This  approach  assumes  that  there  is  sufficient  activity,  or  circuit  parallelism,  in 

.  the  circuit  to  keep  all  of  the  processors  busy. 

\ 

r — We  have  used  two  approaches  in  this  work.  First,  wg-iretve  formulated  a  model  for 
studying  circuit  parallelism  and  the  potential  speedup  of  parallel  logic-level  simulation. 


Using  this  model 


considered  the  effect  of  the  choice  of  timing  model  and  syn- 


We  -nave 


chronization  strategy  on  speedup.  We  -have  also  investigated  the  effect  of  circuit  size  on 

parallelism.  \  ^ 

- 1 - ^ 

<—  Additionally,  we-have  developed  a  methodology  for  measuring  circuit  parallelism,  and 
used  it  to  determine  the  parallelism  of  nine  circuits  using  two  different  simulators.  Em¬ 
pirical  measurements  have  also  been  used  to  validate  portions  of  the  formal  model.  ( k' 
The  major  results  of  the  model  are: 


•  Unit-delay  timing  provides  as  least  as  much  parallelism  as  variable-delay  or  fixed- 
delay  timing. 

•  Asynchronous  timing  strategies  can  improve  simulation  speed  over  synchronous 
strategies.  However,  for  unit-delay  timing,  if  all  event  evaluation  times  are  equal, 
the  asynchronous  strategies  do  not  provide  additional  speedup  over  synchronous 
simulation. 

•  In  general,  the  percentage  of  parallelism  is  not  constant  over  circuit  size,  even  for 
members  of  the  same  circuit  family. 

We  have  used  empirical  results  to  validate  the  parallelism  results  for  variable-delay 
and  unit-delay  synchronous  simulation.  We  also  have  empirical  res’  c>  that  show  that  the 
percentage  of  parallelism  changes  as  the  instance  size  changes  for  these  strategies. 

Finally,  the  empirical  measurements  have  provided  a  set  of  benchmarks  for  the  paral¬ 
lelism  of  circuits.  These  measurements  are  remarkably  low  for  variable-delay  timing.  Thus, 
the  direct  application  of  circuit  activity  to  synchronous  parallel  simulation  for  variable- 
delay  timing  is  doomed!  It  may  be  feasible  for  unit-delay  timing,  or  for  variable-delay 
timing  using  asynchronous  strategies. 
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Chapter  1 


Introduction 


Simulation  is  the  principal  tool  used  in  VLSI  design  to  determine  the  correctness  of  a 
circuit  and  for  analyzing  its  performance  before  fabrication.  Because  fabrication  is  so 
expensive,  both  in  time  and  money,  it  is  essential  that  a  circuit  has  a  high  probability  of 
working  correctly  the  first  time.  However,  even  moderately  large  circuits  can  take  hours 
or  days  to  simulate. 

In  recent  years,  technology  has  allowed  larger  and  larger  circuits  to  be  placed  on  a 
single  chip,  a  trend  which  should  continue  in  the  near  future.  Unfortunately,  as  circuit 
size  increases,  so  does  simulation  time,  and  advances  in  simulation  have  not  kept  up 
with  those  in  technology.  Thus,  circuit  simulation,  already  a  time-consuming  part  of  the 
design  process,  is  becoming  an  increasing  bottleneck  in  the  VLSI  design  process. 

One  way  to  decrease  simulation  time  is  to  increase  the  abstraction  level  of  the  sim¬ 
ulator.  Circuit-level  simulators,  for  example  SPICE  [Nagel  75],  solve  the  differential 
equations  describing  the  state  of  all  structures  on  a  chip.  They  provide  detailed  timing 
information  and,  if  parametrized  with  the  correct  process  values,  are  widely  regarded  as 
reliable.  They  use  and  report  actual  voltage  values.  These  simulators  typically  can  only 
handle  relatively  small  circuits,  on  the  order  of  hundreds  of  transistors. 

Switch-level  simulators  idealize  a  transistor  as  a  switch  and  compute  the  resulting 
state  of  a  circuit.  Voltages  are  also  abstracted  to  a  small  number  of  discrete  values. 
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typically  0,  1,  and  X.  These  simulators  can  handle  much  larger  circuits  than  circuit- 
level  simulators,  including  most  circuits  that  can  fit  on  VLSI  chips  today.  They  cannot 
correctly  simulate  the  analog  characteristics  of  circuits,  but  do  correctly  model  the  bidi¬ 
rectionality  of  transistors.  Switch-level  simulators  come  in  two  varieties:  those  with 
timing  and  those  which  only  provide  functional  results. 

There  are  at  least  two  other  abstraction  levels  that  should  be  mentioned.  Logic-level 
or  gate-level  simulators  model  circuits  as  boolean  functions.  Here  the  bidirectionality 
of  transistors  is  lost,  so  if  this  is  essential  to  the  circuit’s  function,  this  abstraction 
level  will  not  yield  the  proper  results.  The  Yorktown  Simulation  Engine  [Denneau  83]  is 
an  example  of  a  gate-level  simulator.  Finally,  behavioral  or  functional-level  simulators 
represent  large  portions  of  the  circuit  by  a  single  model.  N.2  [Ordy  83]  is  an  example  of 
this  type  of  simulator. 

Though  simulating  at  higher  levels  of  abstraction  is  useful  during  the  design  process, 
it  is  usually  desirable  to  simulate  the  entire  chip  at  the  switch-level.  Some  designers 
do  this  early  in  the  design  process,  using  a  netlist  representation  of  the  chip.  Then 
the  final  layout  can  be  compared  to  this  netlist  by  using  a  verification  tool  such  as 
Gemini  [Ebeling  88]  to  ensure  correctness.  Others  do  this  just  before  fabrication,  using 
a  simulation  file  extracted  from  the  actual  chip  layout.  In  either  case,  the  entire  chip  is 
usually  simulated  at  the  switch-level  abstraction.  Because  the  entire  chip  is  simulated  at 
the  switch-level,  the  simulation  time  can  be  quite  long.  We  would  like  to  decrease  this 
time,  so  we  are  primarily  concerned  with  switch-level  simulation. 

Parallel  simulation  has  often  been  suggested  as  a  means  of  speeding  up  circuit  simula¬ 
tion.  For  switch-level,  logic-level,  and  functional-level  simulation,  one  common  approach 
is  to  partition  the  circuit  among  multiple  processors,  with  each  processor  executing  the 
same  algorithm  on  its  portion  of  the  circuit  [Smith  86].  For  synchronous  event-driven 
simulation,  this  has  several  implications.  First,  there  must  be  a  large  amount  of  circuit 
activity,  or  circuit  parallelism,  in  the  circuit.  Second,  there  must  be  a  good  partition 
which  spreads  out  this  activity.  Finally,  the  overhead  of  communication  and  synchro- 
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nization  must  be  reasonable. 

We  focus  on  the  first  issue,  circuit  activity  or  circuit  parallelism.  Circuit  parallelism 
provides  an  upper  limit  on  the  potential  speedup  of  synchronous  parallel  simulation,  since 
this  is  the  average  number  of  events  that  can  be  executed  simultaneously,  assuming  an 
infinite  number  of  processors  and  no  cost  for  communication  and  synchronization.  If 
there  is  little  parallelism,  then  the  other  issues  are  not  important.  No  matter  how  good 
the  partition  and  communication  overhead  is,  without  sufficient  parallelism,  parallel 
synchronous  event-driven  switch-level  simulation  is  doomed. 

1.1  Timing  Models 

There  are  several  different  timing  models  that  are  commonly  found  in  logic-level  simula¬ 
tors.  Three  are  variable-delay,  fixed-delay  and  unit-delay.  Each  model  causes  a  different 
simulation  strategy  to  be  employed.  Thus,  we  need  to  understand  these  models  in  order 
to  discuss  their  effects  on  circuit  parallelism. 

Variable-delay  simulators  generally  provide  the  most  reliable  timing  information  of 
the  three  models.  Each  simulator  event,  a  node  changing  value,  is  queued  with  a  specific 
delay  which  depends  on  both  the  circuit  topology  and  the  characteristics  of  the  current 
state  of  the  circuit.  There  is  a  wide  variability  in  the  delay  times  that  are  used,  and  in 
principle,  there  are  an  infinite  number  of  delays  available  for  use.  The  most  widely  used 
switch-level  timing  simulator  is  RNL. 

Fixed-delay  simulators  use  a  relatively  small  fixed  number  of  delays  in  the  circuit. 
These  simulators  axe  primarily  gate-level  simulators,  with  each  gate  type  having  identical 
delays,  although  there  may  be  several  delays  per  gate.  For  example,  there  are  often 
different  rising  and  falling  delays  for  each  gate  type.  The  simulator  may  also  support 
multiple  gate  types  for  a  family  of  gates.  For  example,  there  may  be  several  AND  gate 
types,  representing  different  speeds.  The  delays  depend  on  the  dynamics  of  the  circuit 
only  to  the  extent  that  they  depend  on  the  node  values,  but  they  do  not  depend  on  the 
topology  of  the  circuit.  Lsim,  a  mixed  gate  and  switch-level  simulator,  is  an  example 
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Figure  1.1:  Comparing  Unit-Delay  and  Pseudo  Unit-Delay  Timing 
fixed-delay  simulator  [Chamberlain  86]. 

Unit-delay  simulators  provide  a  simple  delay  mechanism  at  the  expense  of  providing 
timing.  For  a  gate-level  simulation,  the  definition  of  unit-delay  timing  is  simple.  At 
each  timestep,  all  gates  whose  input(s)  have  changed  are  evaluated  using  the  current 
values  of  the  nodes,  and  then  all  of  the  resulting  new  outputs  are  updated  and  their 
gates  are  queued  for  the  next  timestep.  The  two  important  issues  here  are  that  every 
gate  takes  one  timestep  to  change,  and  that  ail  of  the  resulting  node  changes  take  effect 
simultaneously.  For  switch-level,  a  similar  algorithm  is  used,  except  a  gate  evaluation  is 
replaced  by  the  evaluation  of  a  transistor  group,  a  set  of  nodes  which  are  connected  via 
transistor  sources  and  drains.  MOSS1M  II,  SwitchSim,  and  COSMOS  are  examples  of 
switch-level  unit-delay  simulators. 

For  completeness,  we  also  discuss  pseudo  unit-delay  timing.  This  timing  model  is 
analogous  to  unit-delay  timing,  with  the  exception  that  node  changes  are  imposed  on  the 
circuit  as  soon  as  they  are  evaluated.  This  means  that  node  changes  within  a  timestep 
take  place  incrementally  instead  of  simultaneously.  Because  the  node  changes  take  place 
incrementally,  the  event  sequence  depends  on  the  order  that  events  are  placed  on  the 
queue,  and  in  some  cases,  this  can  affect  the  outcome  of  the  simulation.  For  example, 
consider  the  example  circuit  in  Figure  1. 1  [Bryant  81,  Terman  83).  Assume  that  in  the 
initial  state  of  the  circuit  the  input  is  0  and  both  outputs  are  1.  We  want  to  change 
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the  input  to  1  and  see  how  this  affects  the  output.  With  unit-delay,  the  next  state  is  0 
for  both  outputs  since  in  the  computation  both  previous  outputs  are  used.  L  ite  outputs 
will  then  return  to  1  and  will  continue  to  oscillate  forever.  In  the  pseudo  unit-delay 
algorithm,  one  output,  say  01,  will  be  evaluated  first,  and  its  value  will  be  changed  to 
0.  Then  when  02  is  evaluated,  the  value  of  01  is  0,  so  its  value  doesn’t  change.  This 

provides  a  stable  solution,  with  one  output  staying  at  1  and  the  other  one  making  a 

\ 

transition  to  0.  Which  output  changes,  however,  depends  on  the  evaluation  order.  If  02 
is  evaluated  first,  the  output  values  are  reversed. 


1.2  Synchronization  Strategies 

Early  parallel  logic-level  simulators  were  generally  synchronous  hardware  simulation  en¬ 
gines  [Blank  84,  Denneau  83].  In  these  systems,  all  processors  are  synchronized  at  the 
end  of  each  timestep.  If  the  circuit  activity  is  uneven,  then  some  processors  are  idle 
while  others  finish  their  computations.  Because  optimal  speedup  requires  the  activity 
to  be  spread  out  during  each  timestep,  as  opposed  to  spreading  out  activity  over  the 
entire  simulation  step,  partitioning  is  critical  and  poor  partitioning  can  greatly  reduce 
the  efficiency  of  the  simulation. 

Asynchronous  strategies  attempt  to  reduce  the  synchronization  overhead  and  also  to 
relieve  the  global  event  queue  bottleneck.  In  asynchronous  algorithms,  each  processor  has 
its  own  simulation  clock  which  may  differ  from  the  simulation  clocks  of  other  processors. 
Even  though  the  clocks  are  distributed,  the  strategies  must  ensure  that  the  resulting 
simulation  produces  the  same  results  as  a  synchronous  simulation.  There  are  two  basic 
techniques  for  assuring  this:  conservative  and  optimistic  strategies. 

The  conservative  strategies  were  pioneered  by  Bryant  (Bryant  77]  and  Chandy  and 
Misra  [Chandy  81].  Here  the  simulation  clock  can  only  proceed  if  it  is  sure  that  no 
other  events  will  arrive  with  timestamps  less  than  the  current  clock  time.  This  means 
that  it  “knows”  that  the  events  are  processed  in  the  correct  sequence.  The  problem 
with  this  strategy  is  that  the  system  can  deadlock  since  all  processors  may  be  waiting 
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for  messages  from  other  processors  before  continuing.  Thus  a  deadlock  detection  and 
recovery  scheme  is  necessary,  or  there  must  be  schemes  to  avoid  deadlock.  The  deadlock 
avoidance  mechanisms  generally  involve  passing  NULL  messages  to  ensure  that  time  can 
always  proceed. 

Optimistic  strategies  were  pioneered  by  Jefferson  [Jefferson  85].  Here  a  processor 

continues  to  process  events,  even  if  there  may  be  later  arriving  events  with  smaller 

\ 

simulation  times.  Periodically,  the  processor  also  checkpoints  its  state.  When  an  event 
arrives  with  a  simulation  time  smaller  than  the  current  simulation  clock  (i.  e.  an  event 
arrives  in  the  past),  the  processor  rolls  back  its  state  to  a  time  less  than  this  time, 
and  cancels  all  erroneous  events  produced  by  the  premature  processing  of  events.  The 
rationale  for  this  approach  is  that  ( 1)  some  of  the  time  this  strategy  will  proceed  correctly 
whereas  the  conservative  strategy  would  idle,  and  (2)  the  time  it  will  spend  processing 
erroneous  events  would  be  spent  idling  in  the  conservative  strategy.  However,  there  is 
additional  overhead  in  this  strategy  required  to  checkpoint  the  state  (in  both  time  and 
space),  and  for  rollback  and  cancellation  of  erroneous  messages. 

1.3  Contributions  of  This  Work 

In  this  dissertation  we  discuss  the  issues  of  circuit  parallelism  and  the  potential  of  parallel 
simulation,  focusing  on  logic-level  simulation  as  opposed  to  circuit-level  simulation.  We 
provide  a  formal  model  for  comparing  logic-level  parallel  simulation  using  three  timing 
strategies  and  three  synchronization  strategies.  Assuming  an  infinite  number  of  proces¬ 
sors,  we  show  that  for  synchronous  simulation,  unit-delay  timing  provides  the  greatest 
speedup,  and  that  for  a  given  timing  strategy,  the  conservative  asynchronous  strategy 
performs  better  than  the  synchronous  strategy.  We  also  show  that  for  a  fixed  number  of 
processors,  there  are  cases  where  the  optimistic  strategy  is  better  than  the  conservative 
strategy  and  vice  versa.  The  asynchronous  strategies  should  not  provide  a  great  increase 
in  speedup  for  unit-delay  timing,  since  if  the  event  evaluation  times  are  ail  equal,  the 
conservative  asynchronous  strategy  provides  the  same  speedup  as  the  synchronous  strat- 


egy.  We  also  use  the  formal  model  to  show  that  using  a  synchronous  strategy  and  a 
given  circuit  family,  the  percentage  of  parallelism  may  change  as  the  size  of  the  circuit 
increases. 

In  addition  to  the  formal  model,  we  provide  a  methodology  for  measuring  circuit  par¬ 
allelism.  We  use  two  simulators  to  demonstrate  the  methodology,  and  provide  empirical 
results  to  corroborate  the  model.  We  find  that  while  the  model  abstracts  many  specific 
characteristics  of  the  simulators,  it  still  provides  reasonable  results. 

Finally  we  provide  measurements  that  show  how  much  parallelism  is  available  in 
VLSI  circuits  using  a  synchronous  strategy.  Since  we  do  not  expect  asynchronous  tech¬ 
niques  to  provide  markedly  faster  simulation  times  than  the  unit-delay  measurements, 
and  they  may  be  much  lower  for  variable-delay  simulation,  we  can  use  the  unit-delay 
measurements  to  estimate  the  potential  speedup.  These  measurements  range  from  35  to 
593.  These  figures  represent  the  speedup  one  can  obtain,  assuming  no  communication 
and  synchronization  cost,  perfect  partitioning,  and  an  infinite  supply  of  processors.  We 
believe  that  these  measurements  show  that  there  is  a  small  amount  of  parallelism  avail¬ 
able  for  exploitation  in  parallel  simulation,  and  only  a  small  number  of  processors  can 
be  effectively  used  to  speed  up  sequential  logic-level  simulation. 


1.4  Thesis  Organization 

In  this  dissertation  we  first  present  the  theoretical  results,  and  follow  this  with  empirical 
data.  In  particular,  the  dissertation  is  organized  as  follows.  In  Chapter  2  we  provide 
a  summary  of  related  work.  Chapter  3  contains  the  formal  portion  of  the  thesis.  Here 
we  define  the  formal  model  for  considering  circuit  parallelism  and  use  it  to  investigate 
ways  of  speeding  up  logic-level  simulation.  The  remainder  of  the  dissertation  is  spent 
in  evaluating  the  model  via  empirical  results.  Chapter  4  lays  the  foundations  for  the 
empirical  studies  by  describing  a  methodology  for  measuring  circuit  parallelism.  Chap¬ 
ters  5  through  7  then  analyze  portions  of  the  model  using  empirical  results.  Finally,  in 
Chapter  8  we  conclude  and  discuss  future  work  in  this  area. 


